-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(pgwire): unify query logs and slow query logs #13924
Conversation
Signed-off-by: Bugen Zhao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Signed-off-by: Bugen Zhao <[email protected]>
Signed-off-by: Bugen Zhao <[email protected]>
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #13924 +/- ##
==========================================
+ Coverage 68.00% 68.07% +0.06%
==========================================
Files 1542 1543 +1
Lines 266905 266920 +15
==========================================
+ Hits 181520 181714 +194
+ Misses 85385 85206 -179
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Good Catch
/// Newtype wrapper for `DefaultFields`. | ||
/// | ||
/// `fmt::Layer` will share the same `FormattedFields` extension for spans across | ||
/// different layers, as long as the type of `N: FormatFields` is the same. This | ||
/// will cause several problems: | ||
/// | ||
/// - `with_ansi(false)` does not take effect and it will follow the settings of | ||
/// the primary fmt layer installed above. | ||
/// - `Span::record` will update the same `FormattedFields` multiple times, | ||
/// leading to duplicated fields. | ||
/// | ||
/// As a workaround, we use a newtype wrapper here to get a different type id. | ||
/// The const generic parameter `SLOW` is further used to distinguish between the | ||
/// query log and the slow query log. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically, what part of the following log is controlled by this? (What does it look like without this?)
2023-12-12T16:43:36.017757+08:00 risingwave-main ThreadId(10) handle_query{mode="simple query" session_id=0 sql=select fu;}: status="err" time=68.853375ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- handle_query
+ <some control character for ANSI color>handle_query
- {mode="simple query" session_id=0 sql=select fu;}:
+ {mode="simple query" session_id=0 sql=select fu; sql=select fu; sql=select fu;}:
status="err" time=68.853375ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it's colorful: https://docs.rs/tracing-subscriber/0.3.18/src/tracing_subscriber/fmt/fmt_layer.rs.html#815-823
Why sql
is recorded multiple times: https://docs.rs/tracing-subscriber/0.3.18/src/tracing_subscriber/fmt/fmt_layer.rs.html#851-854
Notice the type parameter N
here.
/// The span is only effective if there's a current session and the message is | ||
/// query-related. Otherwise, `Span::none()` is returned. | ||
fn root_span_for_msg(&self, msg: &FeMessage) -> tracing::Span { | ||
let Some(session_id) = self.session.as_ref().map(|s| s.id().0) else { | ||
return tracing::Span::none(); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is there no session? 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess when the Startup
message is not handled?
@@ -460,26 +554,12 @@ where | |||
async fn process_query_msg(&mut self, query_string: io::Result<&str>) -> PsqlResult<()> { | |||
let sql: Arc<str> = | |||
Arc::from(query_string.map_err(|err| PsqlError::SimpleQueryError(Box::new(err)))?); | |||
let start = Instant::now(); | |||
record_sql_in_span(&sql); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a little confused. Does this duplicate with the layer in do_process
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the layer in do_process?
What do you mean by this? 😳
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, do_process
emits the log line, but process_query_msg
adds the sql field. I see.
/// The const generic parameter `SLOW` is further used to distinguish between the | ||
/// query log and the slow query log. | ||
#[derive(Default)] | ||
struct FmtFields<const SLOW: bool>(DefaultFields); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the SLOW
used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
risingwave/src/utils/runtime/src/logger.rs
Lines 303 to 306 in 9dcad74
let layer = match is_slow { | |
true => layer.fmt_fields(FmtFields::<true>::default()).boxed(), | |
false => layer.fmt_fields(FmtFields::<false>::default()).boxed(), | |
}; |
There's no actual use for this flag. It just makes the compiler to treat them as different types and then assign them with different type_id
s, which is used for RTTI.
src/utils/pgwire/src/pg_protocol.rs
Outdated
tracing::info!( | ||
target: PGWIRE_QUERY_LOG, | ||
status = if result.is_ok() { "ok" } else { "err" }, | ||
time = ?elapsed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, shall we print the elapsed
in a fixed width and fixed unit format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate more? I'm afraid it's hard to do this for the duration from 50 µs = 0.05 ms
to 5 min = 300 000 ms
. 🥺
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ms
is enough for batch query. How about using .as_millis()
? With the same time unit could let us easy to grep these lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I feel that "unifying query logs and slow query logs" doesn't seem to reduce the complexity but somehow make it more complicated to me... Tend to keep the original implementation, which was simple and straight-forward.
@@ -160,6 +158,17 @@ pub fn cstr_to_str(b: &Bytes) -> Result<&str, Utf8Error> { | |||
std::str::from_utf8(without_null) | |||
} | |||
|
|||
/// Record `sql` in the current tracing span. | |||
fn record_sql_in_span(sql: &str) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the relationship between this and the run_stmt
in #10660?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you're talking about the span created here:
risingwave/src/utils/pgwire/src/pg_protocol.rs
Lines 230 to 236 in 9dcad74
tracing::info_span!( | |
target: PGWIRE_ROOT_SPAN_TARGET, | |
"handle_query", | |
mode, | |
session_id, | |
sql = tracing::field::Empty, // record SQL later in each `process` call | |
) |
This replaces the old one, which was renamed to process_query_msg_one_stmt
after that PR. Pros:
- It's defined in the unified path for handling
Query
Parse
andExecute
, while the old one only handled (simple)Query
. - It's now not only used by traces, but also used to provide common info for the query logs later, including the slow ones.
risingwave/src/utils/pgwire/src/pg_protocol.rs
Lines 312 to 316 in 9dcad74
tracing::info!( target: PGWIRE_QUERY_LOG, status = if result.is_ok() { "ok" } else { "err" }, time = ?elapsed, );
risingwave/src/utils/pgwire/src/pg_protocol.rs
Lines 288 to 292 in 9dcad74
tracing::info!( target: PGWIRE_SLOW_QUERY_LOG, ?elapsed, "slow query" );
As for recording the SQL field after creating the span: I just don't want to duplicate the code for extracting the SQL from the messages. This also reduces allocations.
// Query log. | ||
let fut = async move { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the consecutive async move
layers have performance implications? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess not intuitively. 🤣 This is just like the adaptors for iterators or futures.
To be honest, this PR is more like a fix and a feature than a refactor.
If the refactor were not done, then the code for providing the common fields like New code introduced in this PR is scalable. For example, it requires minimal changes to
|
Signed-off-by: Bugen Zhao <[email protected]>
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
Provide the common info like
mode
,session
,sql
with a tracing span in the entrance functiondo_process
, which...The slow query log was introduced in #6624 a long time ago. However, it has recently been discovered (in #13287 (comment) and also by @liurenjie1024) that it fails to take into account the time required to consume the data stream. This PR also fixes that by handling all these stuff in
pgwire
.Tested locally:
query.log
slow_query.log
Checklist
./risedev check
(or alias,./risedev c
)Documentation
Release note
If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.